Prosper loan data analysis by Eric Persson

Tip: You will see quoted sections like this throughout the template to help you construct your report. Make sure that you remove these notes before you finish and submit your project!

Tip: One of the requirements of this project is that your code follows good formatting techniques, including limiting your lines to 80 characters or less. If you’re using RStudio, go into Preferences > Code > Display to set up a margin line to help you keep track of this guideline!

For this exploratory data analysis we are having a look at loan listings data from a web service called Prosper to try to figure out who are using the service, why they are taking a loan, and what eventually happens to that loan.

Since the original data set contains over 80 variables I have picked out a subset which we will use for our analysis based on the above stated questions. Initially some light data wrangling was also made to either make the data set more readable and to handle NA values.

Univariate Plots Section

Tip: In this section, you should perform some preliminary exploration of your data set. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your data set. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.

Let’s start by having a look at the summary statistics for the data to see what we have to work with.

##  ListingCreationDate                Term      
##  Min.   :2005-11-09 20:44:28   Min.   :12.00  
##  1st Qu.:2008-09-19 10:02:14   1st Qu.:36.00  
##  Median :2012-06-16 12:37:19   Median :36.00  
##  Mean   :2011-07-09 08:07:23   Mean   :40.83  
##  3rd Qu.:2013-09-09 19:40:48   3rd Qu.:36.00  
##  Max.   :2014-03-10 12:20:53   Max.   :60.00  
##                                               
##                 LoanStatus      ClosedDate                 
##  Current             :56576   Min.   :2005-11-25 00:00:00  
##  Completed           :38074   1st Qu.:2009-07-14 00:00:00  
##  Chargedoff          :11992   Median :2011-04-05 00:00:00  
##  Defaulted           : 5018   Mean   :2011-03-07 20:21:21  
##  Past Due (1-15 days):  806   3rd Qu.:2013-01-30 00:00:00  
##  (Other)             : 1266   Max.   :2014-03-10 00:00:00  
##  NA's                :  205   NA's   :58848                
##   BorrowerRate     Occupation        EmploymentStatus  
##  Min.   :0.0000   Length:113937      Length:113937     
##  1st Qu.:0.1340   Class :character   Class :character  
##  Median :0.1840   Mode  :character   Mode  :character  
##  Mean   :0.1928                                        
##  3rd Qu.:0.2500                                        
##  Max.   :0.4975                                        
##                                                        
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           Mode :logical       Mode :logical   
##  1st Qu.: 19.00           FALSE:56459         FALSE:101218    
##  Median : 60.00           TRUE :57478         TRUE :12719     
##  Mean   : 89.64                                               
##  3rd Qu.:130.00                                               
##  Max.   :755.00                                               
##                                                               
##  DebtToIncomeRatio         IncomeRange    TotalProsperLoans
##  Min.   : 0.000    $25,000-49,999:32192   Min.   :0.0000   
##  1st Qu.: 0.140    $50,000-74,999:31050   1st Qu.:0.0000   
##  Median : 0.220    $100,000+     :17337   Median :0.0000   
##  Mean   : 0.276    $75,000-99,999:16916   Mean   :0.2755   
##  3rd Qu.: 0.320    Not displayed : 7741   3rd Qu.:0.0000   
##  Max.   :10.010    $1-24,999     : 7274   Max.   :8.0000   
##  NA's   :8554      (Other)       : 1427                    
##  LoanOriginalAmount LoanOriginationDate           MonthlyLoanPayment
##  Min.   : 1000      Min.   :2005-11-15 00:00:00   Min.   :   0.0    
##  1st Qu.: 4000      1st Qu.:2008-10-02 00:00:00   1st Qu.: 131.6    
##  Median : 6500      Median :2012-06-26 00:00:00   Median : 217.7    
##  Mean   : 8337      Mean   :2011-07-21 03:18:19   Mean   : 272.5    
##  3rd Qu.:12000      3rd Qu.:2013-09-18 00:00:00   3rd Qu.: 371.6    
##  Max.   :35000      Max.   :2014-03-12 00:00:00   Max.   :2251.5    
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
##                                                                          
##  ListingCategory   
##  Length:113937     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Based on the above numerical data our typical loan taker is a first time prosper user with equal possibility to be a homeowner or not, taking a loan over 36 to 40 months with an interest rate of around 19%. The typical size of a loan is $6500.

Next we plot the numbers of occurrences for the nominal variables in our data set.

Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.

From the above plots we can see that most of the loan takers are employed but the type of occupation is seldom given with the vague “Professional” and “Other” occupation types both being in the top ten. A majority of loans are still being repaid but there are also an substantial amount of past due, defaulted or charged-off loans.

Further, the income range looks to be fairly normalized with an expected value somewhere around $50,000. Lastly, we have the listing categories for the loan listings and we can see that roughly half of the the reasons given for the loans through prosper is debt consolidation followed by the rather vague “Not Available” and “Other” categories in the top three.

Now, after have gone through and had an initial look at all the variables in the data set let’s revisit and plot some of the more interesting numerical variables to see how they are distributed over time.

By plotting the above values we discovered some interesting facts such as that the loan term probably are artificially locked at one, three or five years. We also saw that the usual amounts being borrowed are grouped around even $5000 numbers with a maximum at $35000. Lastly, the loan origination dates clearly shows effects from the 2008 recession and also an peak in new loans later years which we still aren’t able to explain. Let’s proceed with some further analysis before investigating inter-variable relations in the bivariate section.

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

The original data set contains more variables than what we could practically cover in one go so to narrow them down we posed some questions about the data. Let’s revisit these questions to see if we made any discoveries worth noting already.

First, to see who is using the service we can look at the following variables mentioned:

  • Employment Status
  • Occupations
  • Income Ranges
  • Is Borrower Homeowner
  • etc.

Based on the summary data and plots presented, our typical loan taker is employed, with an unspecified occupation, and probably an income of $25,000 to $50,000. He/she is currently an homeowner and have debts of about a ratio of 0.22 of their income.

The reason for the loan is most likely debt consolidation with home improvement and business lying as distant seconds among the specified reasons as seen in the histogram with ranked listing categories.

To see what eventually happened with the loan we can have a look at the loan status bin plot giving an overview over the different statuses for all the loans in the data set. Out of a little over 100,000 loan listings we have a little over 10,000 that have been defaulted or charged-off(> 150 days overdue with no reasonable expectation of sufficient payment).

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Loan origination dates vs. closed dates together with the data on overdue, defaulted, or charged-off loans. Using these features together with the above calculated active loans variable it would be interesting to dive deeper and see how, and when, the 2008 recession affected the loans taken. Further, since it seems like the service has enjoyed some explosive growth going in to 2014 it would be very interesting to see what these added loans are and perhaps why they have increased.

Another features worth looking in to is the borrower rates which mostly pikes my interest due to the unclear form of the distribution. Investigating what variables are correlated and how they affect this blob of values centered somewhere around 0.2 would be very interesting and perhaps a good candidate for a regression model and analysis.

Did you create any new variables from existing variables in the dataset?

Using the Loan origination dates together with the closing dates I calculated a new variable called active listings to show the volume of current loans on the service. The calculation where made by taking the difference between originated and closed loans for each date during a period between 2005 and 2014 and the calculate the cumulative sum of those differences.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

In addition to make sure that all variables were read in using the correct data type, I also choose how to handle NA values. For NA values for nominal data was substituted with an preexisting category being best suited to improve the readability of following histograms. Numeral values where set to 0 for ordinal variables where NAs where present this in order to ease calculations in the analysis while not influencing other statistical measures.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

To continue follow our interesting findings regarding the effects of the 2008 recession on active loan volumes let’s plot the change in loan volumes for the entire period. We will use the monthly deltas to make the graph easier to read.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your data set. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the data set.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!